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ABSTRACT 

A data warehouse cannot materialize all possible views, hence we 
must estimate quickly, accurately, and reliably the size of views 
to determine the best candidates for materialization. Many avail- 
able techniques for view- size estimation make particular statisti- 
cal assumptions and their error can be large. Comparatively, unas- 
suming probabilistic techniques are slower, but they estimate accu- 
rately and reliability very large view sizes using little memory. We 
compare five unassuming hashing-based view-size estimation tech- 
niques including Stochastic Probabilistic Counting and LogLog 
Probabilistic Counting. Our experiments show that only General- 
ized Counting, Gibbons-Tirthapura, and Adaptive Counting pro- 
vide universally tight estimates irrespective of the size of the view; 
of those, only Adaptive Counting remains constantly fast as we in- 
crease the memory budget. 

Categories and Subject Descriptors 

H. 3.2 [Information Storage and Retrieval]: Information Storage; 
G.3 [Probability and Statistics]: Probabilistic algorithms 

General Terms 

Algorithms, Performance, Experimentation, Reliability. 

Keywords 

OLAP, materialized views, view-size estimation, data warehouse, 
random hashing. 

I. INTRODUCTION 

View materialization is one of the most effective technique to im- 
prove query performance of data warehouses. Materialized views 
are physical structures which improve data access time by pre- 
computing intermediary results. Typical OLAP queries consist 
in selecting and aggregating data with grouping sets (GROUP BY 
clauses) [13]. By precomputing many plausible groupings, we can 
avoid slow responses due to aggregates over large tables. Many 
queries, such as those containing conditions (HAVING clauses) can 
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also be computed faster using these preaggregates. However, mate- 
rializing views requires additional storage space and induces main- 
tenance overhead when refreshing the data warehouse. Moreover, 
the number of views is large: there are 2 d views in a ^-dimensional 
data cube lattice [13]. Hence, one of the most important issues in 
data warehouse physical design is the selection of the views to ma- 
terialize, an NP-hard problem [14]. Most heuristics for this prob- 
lem depend on view- size estimation. 

Some view-size estimation techniques make assumptions about 
the data distribution and others are "unassuming". A common sta- 
tistical assumption is uniformity [12], but any skew in the data leads 
to an overestimate. 

Generally, while statistically assuming estimators are computed 
quickly, the most expensive step being the random sampling, their 
error can be large and it cannot be bounded a priori. We consider 
several state-of-the-art statistically unassuming estimation tech- 
niques: Probabilistic Counting [10], LogLog Probabilistic Count- 
ing [7], Adaptive Counting [6], Generalized Counting [5], and 
Gibbons-Tirthapura [11]. While relatively expensive, unassuming 
estimators tend to provide good accuracy and reliability [4]. 

To use these techniques, we need to hash rows quickly and our 
theoretical bounds require at least pairwise independent hash val- 
ues. Fortunately, while there can be several dimensions (d > 10) 
in a data cube, the number of attribute values in each dimension 
is often small compared to the available memory. Hence, we can 
hash dimensions separately, store the result in main memory, and 
combine these fully independent unidimensional hash values into 
3 -wise independent multidimensional hash values. 

Typically, as we allocate more memory, our algorithms become 
more accurate, but also slower. We are concerned with two differ- 
ent usage scenario. Firstly, we want rough estimates, with errors as 
large as 10%, as quickly as possible. In such cases, we can use tiny 
memory budgets (less than 1 MiB). Secondly, we want highly accu- 
rate estimates with errors less than 1% or 0.1%. In these instances, 
we use several megabytes of memory. 

The main result of this paper is an exhaustive theoretical and 
experimental comparisons of a wide range of unassuming view- 
size estimation techniques. We also present practical theoretical 
results on Generalized Counting, a novel algorithm. Finally, we 
make some recommendations. 

2. RELATED WORK 

Sample-based, statistically assuming estimations are typically 
fast, but can be inaccurate and can still use a lot of memory. In- 
deed, in the worst-case scenario, the histogram of the sample might 
be as large as the view size we are trying to estimate. Moreover, it 
is difficult to derive unassuming accuracy bounds since the sample 



might not be representative and the model might not be a good fit. 
However, a sample-based algorithm is expected to be an order of 
magnitude faster than an algorithm which processes the entire data 
set. Haas et al. [15] estimate the view size from the histogram of a 
sample: adaptively, they choose a different estimator based on the 
skew of the distribution. Faloutsos et al. [8] obtain results nearly as 
accurate as Haas et al., that is, an error of approximately 40%, but 
with a simpler algorithm. 

Stochastic Probabilistic Counting [10], LogLog Probabilistic 
Counting (henceforth LogLog) [7] and Adaptive Counting [6] 
have been shown to provide very accurate view- size estimations 
quickly for very large views, but their estimates assume we have 
independent hashing. Because of this assumption, their theoretical 
bound may not hold in practice. 

Gibbons and Tirthapura [11] derived an unassuming bound, for 
an algorithm we will refer to as Gibbons-Tirthapura or GT, that 
only requires pairwise independent hashing. It has been shown re- 
cently that if you have &-wise independent hashing for k > 2 the 
theoretically bound can be improved substantially [17]. Bar-Yossef 
et al. [5, Section 2] presented a new scheme which they described 
as a generalization of Probabilistic Counting, assuming only pair- 
wise independent hashing. The benefit of these new schemes is 
that as long as the random number generator is truly random and 
the hashed values use enough bits, the theoretical bounds have to 
hold irrespective of the size of the view or of other factors. We 
can be certain to have high accuracy and reliability, but what about 
speed? 

3. ESTIMATION BY MULTIFRACTALS 

We implemented the statistically assuming algorithm by Falout- 
sos et al. based on a multifractal model [8]. Given a sample, all that 
is required to learn the multifractal model is the number of distinct 
elements in the sample Fq, the number of elements in the sample 
N f , the total number of elements N, and the number of occurrences 
of the most frequent item in the sample m max . Hence, a very simple 
implementation is possible (see Algorithm 1). The memory usage 
of this algorithm is determined by the GROUP BY query on the 
sample (line 6): typically, a larger sample will lead to a more im- 
portant memory usage. 



Algorithm 1 View- size estimation using a multifractal distribution 
model. 

1 : INPUT: Fact table t containing N facts 

2: INPUT: GROUP by query on dimensions Di 1 D 2 ,...,D d 

3: INPUT: Sampling ratio < p < 1 

4: OUTPUT: Estimated size of group by query 

5: Choose a sample in t' of size N' = \_pN\ 

6: Compute #=group BY(r') 

7: let m max be the number of occurrences of the most frequent tuple 
x\,...,x d ing 

8: let Fo be the number of tuples in g 

9:k^\\ogF ] 
10: while F < F do 
11: p^(m max /N f y/ k 

12: F-lLoC)(l-(/- a (l-p)T') 
13: k^k+1 
14: P ^(m max /N) l / k 

15: RETURN: I* =0 (*)(1 - (/-(l -p)*f ) 



4. UNASSUMING ESTIMATION 

All unassuming methods presented in this paper use the same 
probabilistic idea. Whereas the initial data has unknown distribu- 
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Figure 1: Irrespective of the original data, the hashed values 
can be uniformly distributed. 

tion, if we use an appropriate random hashing method, the hashed 
values are uniformly distributed (see Fig. 1). 

4.1 Independent Hashing 

Hashing maps objects to values in a nearly random way. We 
are interested in hashing functions from tuples to [0,2 L ) where L 
is fixed (L = 32 or L = 64 in this paper). Hashing is uniform 
if P(h(x) = y) = 1/2 L for all x,y, that is, if all hashed values 
are equally likely. Hashing is pairwise independent if P{h{x\) = 
yi Ah(x 2 )=y 2 )=P(h(x l ) = yi )P(h(x 2 ) =y 2 ) = 1/4 L for allx;,y;. 
Pairwise independence implies uniformity. Hashing is &-wise inde- 
pendent if P(h(x\) = y\ A • • • A hfa) = y^) = l/2 kL for all x^yt. 
Finally, hashing is (fully) independent if it is &-wise independent 
for all k. Fully independent hashing of Fo distinct values requires 
£1(Fq) units of memory [1] and is thus impractical if Fq is large. 

We can compute 3 -wise independent hash values efficiently in a 
multidimensional data warehouse setting. For each dimension 
we build a look-up table 7J, using the attribute values of D; as keys. 
Each time we meet a new key, we generate a random number in 
[0,2 L ) and store it in the look-up table 7J. This random number 
is the hashed value of this key. This table generates (fully) inde- 
pendent hash values in amortized constant time. In a data ware- 
housing context, whereas dimensions are numerous, each dimen- 
sion will typically have few distinct values: for example, there are 
only 8,760 hours in a year. Therefore, the look-up table will of- 
ten use a few Mib or less. When hashing a tuple x\ ,x 2 , . . . in 
Di x D 2 x . . . D k , we use the value T\ (jq) T 2 (x 2 ) • • • T k (x k ) 
where is the EXCLUSIVE OR operator. This hashing is 3-wise 
independent and requires amortized constant time. Tables T( can be 
reused for several estimations: we can simultaneously estimate the 
size of a GROUP BY on D\ and D 2 , and the size of a GROUP BY on 
D 2 and D3 while using a single table T 2 . 

4.2 Probabilistic Counting 

Our version of (Stochastic) Probabilistic Counting [10] (or just 
Counting for short) is given in Algorithm 2. LogLog (see Algo- 
rithm 3) is a faster variant [7]. The main difference between the two 
algorithms is that LogLog only keeps track of the maximum num- 
ber of leading zeroes, whereas Probabilistic Counting keeps track 
of all observed numbers of leading zeroes and is thus more resilient 
to outliers in the hashing values (see Fig. 2). For the same param- 
eter M, the memory usage of the two algorithms is comparable in 
practice: Probabilistic Counting uses a M x L binary matrix and 
LogLog uses M counters to store integer values ranging from 1 to 
L — logM. Assuming independent hashing, these algorithms have 
(relative) standard error (or the relative standard deviation of the 
error) of 0.18/ y/M and 13/y/M respectively (see Fig. 3). These 
theoretical results assume independent hashing which we cannot 
realistically provide. They also require the view size to be very 
large. Fortunately, we can detect the small views. A small view 
compared to the available memory (M), will leave several of the M 
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Figure 2: Probabilistic counting methods. 

counters unused (array M in Algorithm 2). Thus, following Cai et 
al. [6], when more than 5% of the counters are unused we return a 
linear counting estimate [20] instead of the LogLog estimate: see 
last line of Algorithm 2 (henceforth Adaptive Counting). Finally, 
Alon et al. [2] presented a probabilistic counting variant using only 
pairwise independent hashing, but the error bounds are large: for 
any c > 2, the relative error is bounded by c — 1 with reliability 
1 - 2/c (an error bound of 3900% 19 times out of 20). We do not 
expect these algorithms to be very sensitive to the size of the mem- 
ory M. 

Algorithm 2 View-size estimation using Probabilistic Counting. 

1 : INPUT: Fact table t containing N facts 

2: INPUT: GROUP by query on dimensions £>i ,£) 2 , • • • 

3: INPUT: Memory budget parameter M = 2 k 

4: INPUT: Independent hash function h from d tuples to [0, 2 L ). 

5: OUTPUT: Estimated size of group by query 

6: b <— M x L matrix (initialized at zero) 

7: for tuple x E t do 

8: x' <— %Di,D 2 ,...,D d (x) {projection of the tuple} 
9: y^h{x') {hash x' to [0,2 L )} 
10: a = v mod M 

11: i <— position of the first 1-bit in [y/M\ 
12: b a ^l 
13: A^0 

14: forae{0,l,...,M-l}do 

1 5 : increment A by the position of the first zero-bit in b a $ ,b a ,i,--- 
16: RETURN: M/§2 A I M where (j) w 0.77351 



Algorithm 3 View-size estimation using LogLog and Adaptive 
Counting. 

1 : INPUT: fact table t containing N facts 

2: INPUT: GROUP by query on dimensions Di 1 D 2l ...,D d 

3 : INPUT: Memory budget parameter M = 2 k 

4: INPUT: Independent hash function h from d tuples to [0, 2 L ). 

5: OUTPUT: Estimated size of GROUP by query 

6: M <-0,0,. ..,0 



for tuple x Et do 

x' <— %Di,D 2 ,...,D d (x) {projection of the tuple} 
9: y<-h(jd) {hash*' to [0,2 L )} 
10: j <— value of the first k bits of y in base 2 
11: z <— position of the first 1-bit in the remaining L 

starts at 1) 
12: f^<-max(fW),z) 

13: (original LogLog) RETURN: a M M2^ L J M i 
where a M « 0.39701 - (2tt 2 +ln 2 2)/(48M). 

14: (Adaptive Counting) RETURN: i aMM2M L; ^ 
F 6 \-Mlog(3/M 
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where (3 is the number of 9/Cj for j — 1 , . . . , M with value zero 
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Figure 3: Standard error for Probabilistic Counting and 
LogLog as a function of the memory parameter M. 

4.3 Generalized Counting 

We modified a generalization to Probabilistic Counting [5, Sec- 
tion 2] (henceforth GC), see Algorithm 4. The tuples and hash 
values are stored in an ordered set, and since each tuple is inserted 
(line 14), the complexity of processing each tuple with respect to 
M is in O(logM). However, for small M with respect to the view 
size, most tuples are never inserted since their hash value is larger 
than the smallest M hash values (line 13). 

The original algorithm [5] used many hashing bits: L > 
3L/log | A" | where |Dj| is the number of attribute values in dimen- 
sion D[. The main problem is that the number of required bits de- 
pends on the volume of the cuboid, which is typically far larger 
than the view-size. As the next result shows, with our modified al- 
gorithm, few bits are necessary. For example, when hashing with 
L = 64 bits, using a memory budget of M — 10000, and with rela- 
tive accuracy of 8 = 0.1, we can estimate view sizes far exceeding 
anything seen in practice (2 x 10 21 facts). Moreover, we show that 
the accuracy bounds improve substantially if the hashed values are 
more than pairwise independent (see Fig. 4(a)). 

Proposition 1 For L > 1 + logF /(eM) and M > 2k > 4, Algo- 
rithm 4 estimates a view size Fq within relative precision e < 1 /2 
with reliability 1 — 5 where 5 is given by (4k/ (e 2 / 3 e 2 M))^/ 2 . 

PROOF. Suppose we have Fq distinct tuples in the GROUP BY 
and assume that Fq > M. If M < Fq, we can modify the algorithm 
so that an exact count is returned. 

First, consider the case where we overestimate the true count by 
8, that is 2 L M/max(f7Vf) > (1 +e)Fo, hence we have at least M 
hashed values smaller than 2 L M/((1 +e)Fq). Hashed values take 
integer values in [0, 2 L ). Assuming L > 1 + logFo/(eM), the prob- 
ability that a hashed value is smaller than 2 L M/ ((1 + e)Fq) is less 
than M/((l + e)F ) + 2~ L < M/((l + e)F ) + eM/(2F ) < M(2 + 
8 + 8 2 )/(2(l + e)F ) - Mp/F where p = (2 + 8 + 8 2 )/(2(l + 
8)). Let Xi for i — 1,...,Fq be 1 with probability p/Fq and 
zero otherwise. Write X = Ei=i jr -Xi, we have that X = 
Hi=l,...,F E(Xi) = Mp whereas, by pairwise independence, a 2 = 
var(X) = L-=l,...,F var(Xi) = F (Mp/F - M 2 p 2 /F$) = Mp(l - 
Mp/Fo) < Mp. By a Chernoff-Hoeffding bound [18, Theorem 2.4] 
and the &-wise independence of the X/'s, P(X > M) < P(\X — 

kMp \ k/2 ( hn 



Mp\ > M-Mp) < 



K eV 3 (\-p) 2 M 2 J ~ {e 2 /i{l-p) 2 M) ' We 

have that p < 1 and 1 - p > e/2 for 8 < 1/2, hence P(X > 



M)<(- 



k4 



k/2 



2/3 E 2 M j • Finally, observe that P(2 L M/max(f7Vf ) > (1 + 
8)F )<P(X>M). 
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Figure 4: Bound on the estimation error (19 times out of 20) 
as a function of the number of tuples kept in memory (M) with 
/>wise independent hashing. 

Algorithm 4 Generalized Counting view-size estimation. 

1 : INPUT: Fact table t containing N facts 

2: INPUT: GROUP by query on dimensions Di 1 D 2l ...,D d 

3 : INPUT: Memory budget parameter M 

4: INPUT: &-wise hash function h from d tuples to [0, 2 L ). 

5: OUTPUT: Estimated size of group by query 

6: M <— empty sorted sequence, max(fW) returns an element with largest 

hashed value 
7: f<-0 

8: for tuple x G t do 

9: xf %Di ,D 2 ,...,/) rf M {projection of the tuple} 

10: y <- ) {hash x' to [0, 2 L ) } 

11: if size(fftf )< M then 

12: insert x 1 with hashed value y 'mM 

13: else if v < max(fftf) then 

14: insert x' with hashed value y in M{x' may already be in M} 

15: if size(fW)> M then 

16: remove max(fW) from fftf 

17: RETURN: 2 L size (fftf) /max (fW) 



Similarly, suppose that we underestimate the true count by 8, 
2 L M/max(f7Vf) < (1 — e)Fq, hence we have less than M hashed 
values smaller than 2 L M/((1 —e)Fq). The probability that a 
hashed value is smaller than 2 L M/((1 — e)Fq) is less than M/((l — 
e)F ) +2- L < M/((l - e)F ) + eM/(2F ) < M(2 + e- e 2 )/(2(l - 
e)F ) = M/?/F where p = (2 + 8 - 8 2 )/(2(l - e)). Let X t for 
i = 1, . . . ,Fo be 1 with probability p/Fq and zero otherwise. Write 
X = Y*i=i,...,F x i> we nave mat X = M P whereas a 2 = Mp. Fi- 
nally, P(X <M)< P(\X-Mp\ >M-Mp)< ( e2/3{ ^ p)2M ) k/2 - 
By inspection, we see that p/(l - p) 2 < 2/e 2 , hence P(X <M)< 




which completes the proof. □ 



4.4 Gibbons-Tirthapura 

Originally, the GT algorithm was proposed in the context of data 
streams and parallel processing [1 1] (see Algorithm 5). If the view 
size is smaller than the memory parameter (M), the estimation is 
without error. For this reason, we expect GT to perform well when 
estimating small and moderate view sizes compared to the available 
memory. We can processing most tuples in (amortized) constant 
time with respect to M (line 13) using a hash table, however the 
occasional pruning of tuples requires (amortized) linear time with 
respect to M (line 16). 

The original theoretical bounds [11] assumed pairwise indepen- 
dence. However, more independent hashing, as is possible in our 
context for views with many dimensions, allow for better theoret- 
ical bounds [17] as illustrated by Fig. 4(b). Comparing Fig. 4(b) 
and 4(a), we may be tempted to conclude that GC is far superior to 
GT. We will compare them experimentally. 



Algorithm 5 Gibbons-Tirthapura view- size estimation. 



1; 


TNPTTT* Fart table t rnntainincr N farts 


2: 


INPUT: GROUP BY query on dimensions D\,D2, . . . ,Dd 


3: 


INPUT: Memory budget parameter M 


4: 


INPUT: k- wise hash function h from d tuples to [0, 2 L ). 


5: 


OUTPUT: Estimated size of GROUP BY query 


6: 


*M empty look-up table 


7: 


t ^- 


8: 


for tuple x E t do 


9: 


x? < — z>2 (x) {projection of the tuple} 


10: 


y^h(xf) {hash^ to [0,2 L )} 


11: 


j <— position of the first 1-bit in y (count starts at 0) 


12: 


if j < t then 


13: 


K' = j 


14: 


while size(f7Vf ) > M do 


15: 


t <-t + l 


16: 


prune all entries in M having value less than t 


17: 


RETURN: 2W(fW) 



Proposition 2 Algorithm 5 estimates the number of distinct tuples 
within relative precision 8, with a k-wise independent hash for k>2 
by storing M distinct tuples (M > 8k) and with reliability 1—5 
where 5 is given by 

k k ' 2 ( a k / 2 4 k / 2 \ 

- e k/3 M k/2 y (i _ a )A; + a k/2 E k( 2 k/2 - \) ) ' 

for 4k/ M < a < 1 and any k,M > 0. 

For the case where hashing is 4-wise independent, we derived a 
more concise bound [4]. 

Corollary 1 With 4-wise independent hashing, Algorithm 5 esti- 
mates the number of distinct tuples within relative precision 8 « 
5 / \[M, 19 times out of 20 for 8 small. 

5. EXPERIMENTAL RESULTS 

To benchmark the accuracy and speed of our implementation 
of the view-size estimation algorithms, we have run tests over the 
US Census 1990 data set [16] as well as on synthetic data produced 
by DBGEN [19]. The synthetic data was produced by running the 
DBGEN application with scale factor parameter equal to 2 except 
where otherwise stated. The characteristics of data sets are detailed 
in Table 1. We selected 20 and 8 views respectively from these 
data sets: all views in US Census 1990 have at least 4 dimensions 
whereas only 2 views have at least 4 dimensions in the synthetic 
data set. Statisticians sometimes define the standard error to be the 
standard deviation of the measures, but when the exact value can 
be known, it is better to use the deviation from the true value or 
y/E((X — c) 2 )/c where c is the value we try to estimate. The (rel- 
ative) standard error, defined as the standard deviation of the error, 
was computed from 20 estimates using this formula where c, the 
exact count, was computed once using brute force. 





US Census 1990 


DBGEN 


# of facts 


2458285 


13977981 


# of views 


20 


8 


# of attributes 


69 


16 


Data size 


360 MiB 


1.5 GiB 



Table 1: Characteristic of data sets. 



We used the GNU C++ compiler version 4.0.2 with the "-02" 
optimization flag on an Apple MacPro machine with 2 Dual- 
Core Intel Xeon processors running at 2.66 GHz and 2 GiB of 
RAM. No thrashing was observed. To ensure reproducibility, C++ 
source code is available freely at http://code.google.eom/p/ 



Algorithm 6 Test protocol. 

1 : for GROUP BY query q e Q do 
2: for memory budget m G M do 
3 : for random seed value r e R do 

4: Estimate the size of GROUP by q with m memory budget and 

r random seed value 
5: Save estimation results (time and estimated size) in a log file 

viewsizeestimation/. For the US Census 1990 data set, the 
hashing look-up table is a simple array since there are always 
fewer than 100 attribute values per dimension. Otherwise, for 
the synthetic DBGEN data, we used the GNU/CGI STL exten- 
sion hashjnap which is to be integrated in the C++ standard as an 
unorder ed_map: it provides amortized 0(1) inserts and queries. 
All other look-up tables are implemented using the STL map tem- 
plate which has the computational complexity of a red-black tree. 
We used comma separated (CSV) (and pipe separated files for DB- 
GEN) text files and wrote our own C++ parsing code. 

The test protocol we adopted (see Algorithm 6) has been exe- 
cuted for each unassuming estimation technique, GROUP BY query, 
random seed and memory size. At each step corresponding to those 
parameter values, we compute the estimated GROUP BY view sizes 
and time required for their computation. Similarly, for the multi- 
fractal estimation technique, we computed the time and estimated 
size for each GROUP BY, sampling ratio value and random seed. 

In Subsection 5.1, we consider the first use case: the user is sat- 
isfied with a moderate accuracy (such as 10%). In Subsection 5.2, 
we address the case where high accuracy (at least 1 %) is sought, 
maybe at the expense of memory usage and processing speed. 

5.1 Small memory budgets 

5.7.7 Accuracy 

Test over the US Census 1990 data set 

Fig. 5 represents the standard error for each unassuming estima- 
tion technique and memory size M G {16,64,256,2048}. For the 
multifractal estimation technique, we present the standard error for 
each sampling ratio p e {0.1%, 0.3%, 0.5%, 0.7%}. The X axis 
represents the size of the exact GROUP BY values and the Y axis, 
the corresponding standard error. Both of the X and Y axis are in 
a logarithmic scale. The standard error generally decreases when 
the memory budget increases. However, for small views, the er- 
ror can exceed 100% for Probabilistic Counting and LogLog: this 
is caused by a form of overfitting where many counters are not or 
barely used (see Section 4.2) when the ratio of the view size over 
the memory budget is small. In contrast, Fig. 5(a) shows that GT 
has sometimes accuracy better than 0.01% for small views. For 
the multifractal estimation technique (see Fig. 5(d)), the error de- 
creases when the sampling ratio increases. While the accuracy can 
sometimes approach 10%, we never have reliable accuracy. 

Test over synthetic data 

Similarly, we plotted the standard error for each technique, com- 
puted from the DBGEN data set (see Fig. 6). The five unassuming 
techniques have the same behaviour observed on the US Census 
data set. The model-based multifractal technique (see Fig. 6(d)) 
is especially accurate because DBGEN follows a uniform distri- 
bution [19]. For this reason, DBGEN is a poor tool to bench- 
mark model-based view-estimation techniques, but this problem 
does not carry over to unassuming techniques since they are data- 
distribution oblivious. 

We also performed experiments on large data sets (5, 10, 20 and 
30 GiB) generated by DBGEN. Table 2 shows that the accuracy is 



not sensitive to data and view sizes for small M. In addition, for 
large views, Probabilistic Counting has a small edge in accuracy. 

Table 2: Standard error over large data sets. 

(a) Probabilistic Counting 





Memory budget 


Data size 


View size 


64 


128 256 


5 GiB 


1000000 


11% 


8% 5% 


10 GiB 


2000000 


10% 


7% 6% 


20 GiB 


4000000 


8% 


6% 5% 


30 GiB 


6000000 


9% 


7% 7% 


(b) Gibbons-Tirthapura 






Memory budget 


Data size 


View size 


64 


128 256 


5 GiB 


1000000 


10% 


8% 7% 


10 GiB 


2000000 


9% 


7% 6% 


20 GiB 


4000000 


10% 


8% 6% 


30 GiB 


6000000 


14% 


8% 5% 



5.7.2 Speed 

The time needed to estimate the size of all the views by the unas- 
suming techniques is about 5 minutes for the US Census 1990 data 
set and 7 minutes for the synthetic data set. For the multifractal 
technique, all the estimates are completed in roughly 2 seconds, 
but it takes 1 minute (resp. 4 minutes) to sample 0.5% of the US 
Census data set (resp. the synthetic data set - TPC H), in part be- 
cause the data is not stored in a flat file. We ran further experiments 
on the data generated by DBGEN (with a scale factor equal to 5, 
i.e., 5 GiB of data) to highlight the time spent by each processing 
step: loading and parsing the data, hashing and computing esti- 
mated view sizes. As shown in Table 3, the running time of the al- 
gorithms is sensitive to the number of dimensions. For a low (resp. 
high) number of dimensions, relatively more time is spent reading 
data (resp. hashing data). However, the time spent hashing or read- 
ing is in turn much larger than the rest of the time spent by the 
algorithms (counting). This explains why all the unassuming esti- 
mation algorithms have similar running times and why timings are 
not sensitive to the memory parameter (M), as long as it is small. 

5.2 Large Memory Budgets 

When the memory budget is close to the view size, estimation 
techniques are not warranted. Hence, we did not use the US Census 
data set since it is too small. 

Table 3: Wall-clock running times. 

(a) Unidimensional view (view size = 7.5 x 10 5 ) 
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(b) tridimensional view (view size 


= 2.4 x 10 7 ) 
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(1): LogLog (2): Probabilistic Counting (3): Adaptive Counting 
(4): Gibbons-Tirthapura (5): Generalized Counting 
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(d) Multifractal 
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(f) Adaptive Counting 
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Figure 5: Standard error of estimation as a function of exact view size for increasing values of M (US Census 1990). 



M=16 -+- 


Af=64 X M=25e 


X .W=2048 B 


x 

B 


* x x |" 

3q b @- 

















100000 1e+06 1e+07 1e+08 



M=16 -+- 
/W=64 X 


M=256 X t 
A/=2048 B 


ieoryM=2048 - - 






X 


* 


% x X * 















1e+06 1e+07 

View size 



M=16 -+- 
M=64 X 


M=256 X theory M=2048 - - 
M=2048 B 






3 & ... B .......g,<a 











1e+06 1e+07 1e+08 



(a) Gibbons-Tirthapura 



(b) Probabilistic Counting 



(c) LOGLOG 




1e+06 1e+07 

(d) Multifractal 



M=16 -+- 


M=64 X M=256 


X /W=2048 B 


x^^r 




m 

§ 




r* x * 

3 ,B 3 @ 














W=16 -+- 


M=64 X /W=256 X /W=2048 B 


H 








x" 













1e+06 1e+07 1e+08 



(e) Generalized Counting 



(f) Adaptive Counting 



Figure 6: Standard error of estimation as a function of exact view size for increasing values of M (synthetic data set). 



5.2.7 Accuracy 

Fig. 7 shows the behavior of the five probabilistic schemes over 
a moderately small synthetic unidimensional view. While all five 
schemes have similar accuracy when the memory budget is small 
relative to the size of the view, as soon as the memory budget is 



within an order of magnitude of the view size, they differ signifi- 
cantly: LogLog and Counting are no longer reliable whereas the 
three other schemes quickly achieve nearly exact estimates. As we 
increase the memory budget, this phenomenon happens somewhat 
later with LogLog than Counting. Adaptive Counting still has a 



good accuracy for large M because it switches from LogLog esti- 
mates to linear counting estimates [20] (see Algorithm 3). The ac- 
curacy of GC is limited by the size of L. Finally, we ran some tests 
over a large view using large values of M (see Fig. 8): these values 
of M still translate in memory usages well below 1 GiB. The main 
difference with the large view being that LogLog and Adaptive 
Counting performance seems to be substantially worst than Proba- 
bilistic Counting unless we increase the number of bits (L = 64). 

5.2.2 Speed 

We also computed the time required to estimate a large view us- 
ing various memory budgets M (see Fig. 9). For small values of M 
(Af < 65536) all techniques are equally fast: most processing time 
is spent hashing and parsing the data (see Table 2). For larger val- 
ues of M, the time spent counting the hash values by GC and GT 
eventually dominates the processing time (see Table 3). Probabilis- 
tic Counting scales well with large values of M whereas LogLog 
does not slow down with increasing values of M, but their accura- 
cies do not necessarily improve either. Adaptive Counting remains 
fast and gets increasingly accurate as M becomes large. 
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(M and L large), GC, GT, and Adaptive Counting have accuracies 
better than 0.1%. For large values of M, which of GT and GC is 
more accurate depends on the number of hashing bits used (L). GC 
is the only scheme guaranteed to converge to the true view size as 
M grows. View- size estimation by sampling can take minutes when 
data is not laid out in a flat file or indexed, but the time required for 
an unassuming estimation is even higher. For small values of M, 
streaming and hashing the tuples accounts for most of the process- 
ing time so for faster estimates, we could store all hashed values in 
a bitmap (one per dimension). 

7. CONCLUSION AND FUTURE WORK 

We have provided unassuming techniques for view-size estima- 
tion in a data warehousing context. We adapted distinct count es- 
timators to the view- size estimation problem. Using the standard 
error, we have demonstrated that among these techniques, GC, 
GT, and Adaptive Counting provide stable estimates irrespective 
of the size of views and that increasing the memory usage leads to 
more accuracy. For small memory budgets, all unassuming meth- 
ods have comparable speeds. For large memory budgets, however, 
only Adaptive Counting remains constantly fast. For large view 
sizes, using more hashing bits (L = 64) is important, particularly 
when using Adaptive Counting. 

There is ample room for future work. Firstly, we plan to extend 
these techniques to other types of aggregated views (for example, 
views including HAVING clauses including icebergs [9]). Secondly, 
we want to precompute the hashed values for fast view- size estima- 
tion. Furthermore, these techniques should be tested in a material- 
ized view selection heuristic [3]. 
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Figure 9: Estimation time for a given view (four dimensions 
and 1.18 x 10 7 distinct tuples) as a function of memory budgets 
M (synthetic data set). 



6. DISCUSSION 

Our results show that Probabilistic Counting and LogLog do 
not entirely live up to their theoretical promise. For small view 
sizes relative to the available memory, the accuracy can be very 
low. One implication of this effect is that we cannot increase the 
accuracy of Probabilistic Counting and LogLog by adding more 
memory unless we are certain that all view sizes are very large. 
Meanwhile, we observed that GC, GT, and Adaptive Counting ac- 
curacies are independent of the view size and improve when more 
memory is allocated, though they also become slower, except for 
Adaptive Counting which remains constantly fast. When compar- 
ing the memory usage of the various techniques, we have to keep 
in mind that the memory parameter M can translate in different 
memory usage. The memory usage depends also on the number 
of dimensions of each view. Generally, GC and GT will use more 
memory for the same value of M than either Probabilistic Counting, 
Adaptive Counting, or LogLog, though all of these can be small 
compared to the memory usage of the look-up tables 7] used for 
3-wise independent hashing. When memory usage is not a concern 



9. REFERENCES 

[1] N. Alon, L. Babai, and A. Itai. A fast and simple randomized 

parallel algorithm for the maximal independent set problem. 

J. Algorithms, 7(4):567-583, 1986. 
[2] N. Alon, Y. Matias, and M. Szegedy. The space complexity 

of approximating the frequency moments. In STOC '96, 

pages 20-29, 1996. 
[3] K. Aouiche, P. Jouve, and J. Darmont. Clustering-based 

materialized view selection in data warehouses. In 

ADBIS'06, volume 4152 of LNCS, pages 81-95, 2006. 
[4] K. Aouiche and D. Lemire. Unassuming view-size estimation 

techniques in OLAP. In ICEIS'07, pages 145-150, 2007. 
[5] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and 

L. Trevisan. Counting distinct elements in a data stream. In 

RANDOM '02, pages 1-10, 2002. 
[6] M. Cai, J. Pan, Y.-K. Kwok, and K. Hwang. Fast and 

accurate traffic matrix measurement using adaptive 

cardinality counting. In MineNet'05, pages 205-206, 2005. 
[7] M. Durand and P. Flajolet. Loglog counting of large 

cardinalities. In ESA'03, volume 2832 of LNCS, pages 

605-617, 2003. 
[8] C. Faloutsos, Y. Matias, and A. Silberschatz. Modeling 

skewed distribution using multifractals and the 80-20 law. In 

VLDB'96, pages 307-317, 1996. 



1000 
100 
10 

1 

0.1 
0.01 
0.001 
0.0001 



Figure 7: 

1000 




Gibbons-Tirthapura — I— 
Generalized Counting — -X— 

LogLog 

Counting B 

Adaptive Counting 



x- x- X 



10000 100000 

Memory budget 

(a) L = 32 



1000 
100 
10 

I 

1 

0.1 
0.01 
0.001 
0.0001 




Gibbons-Tirthapura — I— 
Generalized Counting — -X— 

LogLog X 

Counting B 

Adaptive Counting 



1e+006 



10000 100000 

Memory budget 

(b) L = 64 



1e+006 



Standard error accuracy for a small unidimensional view (250,000 items) as a function of memory budgets M. 

1000 



100 



10 



1 : 



0.1 




0.01 



Gibbons-Tirthapura 
Generalized Counting 
LogLog 
Counting 
t Adaptive Counting 



10 100 1000 10000 100000 1e+006 1e+007 1e+008 

Memory budget 

(a) L=32 



100 



10 



1 : 



0.1 



0.01 




Gibbons-Tirthapura 
Generalized Counting 
LogLog 
Counting 
Adaptive Counting 



10 100 1000 10000 100000 1e+006 1e+007 1e+008 

Memory budget 

(b) L=64 



Figure 8: Standard error of estimation for a given view (four dimensions and 1.18 x 10 7 distinct tuples) as a function of memory 
budgets M (synthetic data set). 



[9] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, 

and J. D. Ullman. Computing iceberg queries efficiently. In 

VLDB'98, pages 299-310, 1998. 
[10] P. Flajolet and G. Martin. Probabilistic counting algorithms 

for data base applications. Journal of Computer and System 

Sciences, 3 1(2): 182-209, 1985. 
[11] P. B. Gibbons and S. Tirthapura. Estimating simple functions 

on the union of data streams. In SPAA'01, pages 281-291, 

2001. 

[12] M. Golfarelli and S. Rizzi. A methodological framework for 
data warehouse design. In DOLAP'98, pages 3-9, 1998. 

[13] J. Gray, A. Bos worth, A. Layman, and H. Pirahesh. Data 
cube: A relational aggregation operator generalizing 
group-by, cross-tab, and sub-total. In ICDE '96, pages 
152-159, 1996. 

[14] H. Gupta. Selection of views to materialize in a data 
warehouse. In ICDT97, pages 98-112, 1997. 

[15] P. Haas, J. Naughton, S. Seshadri, and L. Stokes. 



Sampling-based estimation of the number of distinct values 
of an attribute. In VLDB'95, pages 311-322, 1995. 
[16] S. Hettich and S. D. Bay. The UCI KDD archive. 

http : / /kdd . ics . uci . edu, last checked on 23/10/2006, 
2000. 

[17] D. Lemire and O. Kaser. One-pass, one-hash n-gram count 

estimation. Technical Report TR-06-001, Dept. of CSAS, 

UNBSJ, 2006. available from 

http://arxiv.org/abs/cs.DB/0 610010. 
[18] J. Schmidt, A. Siegel, and A. Srinivasan. Chernoff-Hoeffding 

bounds for applications with limited independence. In 

SODA'93, pages 331-340, 1993. 
[19] TPC. DBGEN 2.4.0. http : / /www . tpc . org/tpch/, last 

checked on 23/10/2006, 2006. 
[20] K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A 

linear-time probabilistic counting algorithm for database 

applications. ACM Trans. Database Syst, 15(2):208-229, 

1990. 



